dropout 0
Appendix
In this appendix, we first introduce the datasets and evaluation metrics used in the experiments in Section A. Then, we provide extra experimental results in Section B. In Section C, we present details of network design, training scheme, and hyper-parameter tuning. We conduct experiments on 11 popular time series datasets: (1) Electricity Transformer Temperature [42] (ETTh(1,2),ETTm1) 3consists of 2 year electric power data collected from two separated counties of China. Each data point includes an "oil temperature" value and 6 power load features. The data is aggregated into 5-minutes windows, resulting in 12 points per hour and 288 points per day. A.1 Electricity Transformer Temperature (ETT) For data pre-processing, we perform zero-mean normalization, i.e., X We use Mean Absolute Errors (MAE) [17] and Mean Squared Errors (MSE) [26] for model comparison.
Appendix
Weheldoutavalidation setfromthetraining set,andusedthisvalidation settoselecttheL2 regularization hyperparameter,which weselected from 45logarithmically spaced values between 10 6 and 105, applied to the sum of the per-example losses. Because the optimization problem is convex, we used the previous weights as a warm start as we increased theL2 regularization hyperparameter. Wemeasured eithertop-1ormean per-class accuracy, depending on which was suggested by the dataset creators. A.3 Fine-tuning In our fine-tuning experiments in Table 2, we used standard ImageNet-style data augmentationand trained for 20,000 steps with SGD with momentum of0.9 and cosine annealing [ 20]without restarts. Each curve represents a different model.
SupplementaryMaterialsFor: " DomainAdaptation with InvariantRepresentationLearning: What TransformationstoLearn? "
Furthermore, letφ: X Z be an encoder s.t. Then, there is no functionφ s.t. Let there be a subset in the invariant spaceB Z, and suppose that we have marginal invariance inthelatent space:PS(φ(X) B) = PT(φ(X) B), B. Define thepre-image ofB as: A={a X:φ(a) B}. Let A X be a region s.t. We followed the procedure in [2], and used a mixture kernel function ofq RBF kernels: κ(z1,z2) = Pq i=1ηiexp{ ||z1 z2||2}/σ2i, where σ2i is the kernel width of the i-th kernel, and ηi is a mixing weight which we set to1/q.
c2c2a04512b35d13102459f8784f1a2d-Supplemental.pdf
The tasks is to determine if the sentence has positive or negativesentiment. The task is to determine whether a given sentence is linguistically acceptableornot. RTE: Recognizing Textual Entailment [2, 10, 21, 17] contains 2.5K train examples from textual entailment challenges. Thefine-tuning costsare the same with BERT plus relativepositiveencodings as the same Transformer model is used.